A Comparison of Techniques for Selecting Text Collections

نویسندگان

  • Daryl J. D'Souza
  • James A. Thom
  • Justin Zobel
چکیده

Techniques for evaluating queries against a distributed text document database allow uniform access to separate collections in the database. One such technique is to first choose a subset of collections, via a selection index. The index captures information about each collection such as terms occurring in documents, term statistics, and collection statistics. A possible implementation of such an index is a lexicon, which maintains a complete list of terms in the database. Another approach is to partially index the database by extracting fewer terms but maintaining some information about each document. In this paper we explore three collection-ranking techniques, two based on lexicons and the other based on partial document indexes. Our experiments show that in most cases the lexicon approaches outperform the partial index approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Summarization Using Cuckoo Search Optimization Algorithm

Today, with rapid growth of the World Wide Web and creation of Internet sites and online text resources, text summarization issue is highly attended by various researchers. Extractive-based text summarization is an important summarization method which is included of selecting the top representative sentences from the input document. When, we are facing into large data volume documents, the extr...

متن کامل

Comparison of Error Tree Analysis and TRIPOD BETA in Accident Analysis of a Power Plant Industry Using Hierarchical Analysis

Introduction: Due to the importance and necessity of accident analysis, it is necessary to use proper technique for precise accident analysis and to provide corrective and preventive measures to prevent recurrence of an accident. Method: In this descriptive-analytical paper, the most important criteria for investigating and selecting accident investigation and analysis techniques and selecting...

متن کامل

Sampling strategies for information extraction over the deep web

Information extraction systems discover structured information in natural language text. Having information in structured form enables much richer querying and data mining than possible over the natural language text. However, information extraction is a computationally expensive task, and hence improving the efficiency of the extraction process over large text collections is of critical intere...

متن کامل

Semantic Indexing of Multilingual Corpora and its Application on the History Domain

The increasing amount of multilingual text collections available in different domains makes its automatic processing essential for the development of a given field. However, standard processing techniques based on statistical clues and keyword searches have clear limitations. Instead, we propose a knowledge-based processing pipeline which overcomes most of the limitations of these techniques. T...

متن کامل

Report on the TREC-8 Experiment: Searching on the Web and in Distributed Collections

The Internet paradigm permits information searches to be made across wide-area networks where information is contained in web pages and/or whole document collections such as digital libraries. These new distributed information environments reveal new and challenging problems for the IR community. Consequently, in this TREC experiment we investigated two questions related to information searches...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000